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ABSTRACT 

Diversified histone modifications (HMs) are essential 
epigenetic features. They play important roles in fun- 
damental biological processes including transcription, 
DNA repair and DNA replication. Chromatin regulators 
(CRs), which are indispensable in epigenetics, can 
mediate HMs to adjust chromatin structures and func- 
tions. With the development of ChlP-Seq technology, 
there is an opportunity to study CR and HM profiles at 
the whole-genome scale. However, no specific 
resource for the integration of CR ChlP-Seq data or 
CR-HM ChlP-Seq linkage pairs is currently available. 
Therefore, we constructed the CR Cistrome database, 
available online at http://compbio.tongji.edu.cn/cr and 
http://cistrome.org/cr/, to further elucidate CR func- 
tions and CR-HM linkages. Within this database, we 
collected all publicly available ChlP-Seq data on CRs 
in human and mouse and categorized the data into 
four cohorts: the reader, writer, eraser and remodeler 
cohorts, together with curated introductions and 
ChlP-Seq data analysis results. For the HM readers, 
writers and erasers, we provided further ChlP-Seq 
analysis data for the targeted HMs and schematized 
the relationships between them. We believe CR 
Cistrome is a valuable resource for the epigenetics 
community. 

INTRODUCTION 

Nucleosome function and modification represent import- 
ant epigenetic features. In eukaryotes, the nucleosome is 



composed of an octamer of core histones (two copies of 
H2A, H2B, H3 and H4) and 146 DNA base pairs of DNA 
wrapped around the histone octamer (1). Histone modifi- 
cations (HMs), such as methylation and acetylation, two 
typical types of nucleosome modifications, play essential 
roles in modulating chromatin structures and functions, 
making them indispensable in epigenetic regulation (2-5). 

Histone marks tend to occur in an observable pattern 
known as the histone code, which is coded and decoded by 
chromatin regulators (CRs) including readers, writers and 
erasers (2,4,6-19). Readers usually contain specific 
domains that can recognize specific modified histone 
residues, and they determine the modification type 
(e.g. methylation or acetylation) and state (e.g. mono-, 
di- or tri- for lysine methylation) (20). Writers and 
erasers can post-translationally modify and de-modify 
chromatin, adding and removing certain modifications, 
such as methylation and acetylation, to and from some 
specific histone sites, thus altering chromatin structure 
and recruiting regulatory factors (20,21). In addition to 
the factors that are directly related to HMs, chromatin 
remodelers are also regarded as a type of CR (22-24). 
Chromatin remodelers can make nucleosomal DNA 
easier to access or allow nucleosomes to move to a differ- 
ent position along the DNA, remove or exchange nucleo- 
somes using energy from ATP hydrolysis (20,21,25). CRs 
display vital functions in many common cellular 
processes, such as transcription, replication, recombin- 
ation, apoptosis, differentiation and development, as 
well as in some pathologic processes, especially in cancer 
(21,26-43). 

With increasing attention being paid to CRs and the 
development of ChlP-Seq technology, there are 
abundant available CR ChlP-Seq data and CR-related 
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HM ChlP-Seq data that have been obtained under the 
same conditions (i.e. in the same cell line/type). Analysis 
of the linkages between CR and HM ChlP-Seq data has 
proven to be an effective method for revealing new CR 
functions. EZH2, a subunit of the PRC2 complex, has 
been acknowledged as a transcriptional repressor that 
mediates the generation of H3K27me3. A recent study 
by Xu et al. shows a new role of EZH2 in metastatic 
prostate cancer (44). In this research, by comparing and 
analyzing EZH2 and H3K27me3 ChlP-Seq data, they 
found that a subset of EZH2 peaks is irrelevant to 
H3K27me3. Further study confirms that those irrelevant 
peaks perform as a transcriptional activator of androgen 
receptor, which is independent of other PRC2 subunits 
and its known product H3K27me3. 

The integration and presentation of CR ChlP-Seq data 
and related HM ChlP-Seq data obtained under the same 
conditions can contribute greatly to the study of epigen- 
etics. However, among the relevant publicly available 
databases, such as Histome and Factorbook, there is no 
specific resource providing linkage analysis of CR and 
HM ChlP-Seq data. Histome is a knowledge base that 
integrates detailed information about all human HM 
sites and their related writers and erasers; however, it 
has not associated CRs and HMs with ChlP-Seq data 
(45). Factorbook is a wiki-based database collecting all 
of the TF ChlP-Seq data from human generated by 
ENCODE, together with additional downstream 
analysis, which does not specifically focus on linkage 
pairs between CR and HM (46). This situation has 
driven us to develop CR Cistrome, a unique 
knowledgebase integrating curated information of 36 
CRs, 194 qualified CR ChlP-Seq data sets and 177 
qualified HM ChlP-Seq data sets, and analysis of the re- 
lationship between 458 pairs of CRs and HMs in human 
and mouse. The CRs with related HMs are restricted to 
chromatin readers, writers and erasers, as remodelers 
possess no related HMs. We believe this database repre- 
sents a valuable resource for systematically examining the 
genome-wide functions of CRs and that it may motivate 
investigators who are interested in epigenetics. 



CONSTRUCTION AND CONTENTS 

Data sources 

CR information was derived from different sources. The 
list of readers was acquired from Yun et al. (20), and the 
reader-recognized HMs were summarized through manual 
literature mining. Writers, erasers and their related HMs 
were obtained from the Histome database (45). The list of 
remodelers was obtained from Bao et al. (47). The names 
of all of these CRs are consistent with those in the 
Cistrome Map (23), a database we previously constructed, 
containing all the articles involving human and mouse 
ChlP-Seq data. Furthermore, for each CR, CR Cistrome 
provides its aliases, which are based on the NCBI Gene 
database. In addition, we manually collected some 
detailed and curated information, including summaries, 
functions and interactions and known disease associ- 
ations. The information of ChlP-Seq data on CRs and 



related HMs, including the species, cell line/population, 
cell type, tissue origin and GSE and GSM accession 
numbers were derived from the Cistrome Map, and the 
raw ChlP-Seq data were downloaded in the fastq format 
from GEO at NCBI, EBI and ENCODE from the UCSC 
Genome Browser. 

Database contents 

For each collected CR, CR Cistrome provides three layers 
of contents, as shown in Figure 1 . The first layer provides 
information including the CR's introduction in other 
public databases (NCBI, UniProt, Wikipedia and 
GeneCards are included), its full name and aliases, type 
(writer, eraser, reader or remodeler), manually curated 
function and known associated diseases as well as a CR 
summary. 

In this database, the publically available ChlP-Seq data 
on each CR from human and mouse were collected and 
processed, and the results are shown as the second layer of 
content. In this layer, the peak file (.bed) generated 
through MACS (48) and the reads density file (.bw) 
obtained from bedGraphToBigWig (49) were provided 
for free download. In addition, some annotation results 
were also displayed and can be freely downloaded, such 
as the binding DNA sequence (motif) acquired through 
MDseqpos, average conservation profile across CR's 
peaks, the average profile near the transcription start site 
(TSS), the transcription terminal site (TTS), through the 
gene body, and genome-wide enrichment as indicated by 
CEAS (50). 

The third layer is specifically aimed at CR-HM linkage 
pairs. For the CR-HM linkages presented here, the Reader- 
HM linkage was defined as the reader and its recognized 
HM obtained from the literature and the writer- and eraser- 
related HMs referred to the Histome database. For each 
CR (readers, writers and erasers), if there were available 
ChlP-Seq data for the related HMs from the same cell 
line/type (or, if data from the same cell line/type were not 
available, the species and tissue origin were considered), the 
results of the analysis of the linkage pair between CR and 
the related HM were presented, including the Venn 
diagram between them, the distribution of their overlap 
peaks, the average CR and HM profile in CR's binding 
sites and the reads density plot of the CR and HM in 
CR's binding sites. Furthermore, the results of the 
analysis of the related HM ChlP-Seq data, including the 
profile near the TSS, the TTS, and through the gene body 
as well as the observed genome-wide enrichment and con- 
servation and the freely downloaded peak file and reads 
density file, are contained in this layer. 

To guarantee the quality of the ChlP-Seq data in the 
database, we set criteria (total sequencing reads >5 M and 
detected peaks >500), and only those data sets that met 
these criteria could be added. As a result, 36 CRs 
associated with 194 ChlP-Seq data sets from human and 
mouse were collected in CR Cistrome, and the detailed 
statistics of these data are shown in Table 1. The 
detailed statistics of HM ChlP-seq data are listed in 
Table 2. The database includes 13 pairs and 165 data 
sets for Writer-HM linkages, 12 pairs and 171 data sets 
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Figure 1. Content of the CR Cistrome database. For each CR collected, we provide basic information and ChlP-Seq data analysis results. For each 
reader, writer and eraser (if there are available ChlP-Seq data for the related HMs from the same cell line/pop, or the cell type, when the same cell 
line is not available), we also provide the ChlP-Seq analysis results for these HMs, including the same results as provided for the CR data, except 
that HMs don't have motif scan results. Furthermore, the resultant Venn diagram, the genomic distribution of overlap peaks between CR and HM, 
average CR and HM profile in CR's binding sites and reads density plot of CR and HM in CR's peaks are shown to illustrate the relationship 
between the CRs and related HMs. 



for Eraser-HM linkages and 5 pairs and 122 data sets for 
Reader-HM linkages (Table 3). All of the analysis results 
and information noted earlier in text are stored and 
managed through the MySQL relational database man- 
agement system on a Linux server. 

CR Cistrome is a part of the Cistrome Project (51). And 
for users' convenience to further analyze our CR and HM 
data (.bed and.w), we have now added an interface between 
CR Cistrome and Galaxy/Cistrome (http://cistrome.org/ 
ap/). The users can either import CR Cistrome data (.bed 
and.w) from the 'Import Data' drop-down menu in the 
CISTROME TOOLBOX or send the data to Galaxy/ 
Cistrome by choosing the SEND TO CISTROME 
function in CR Cistrome. Then they can use the powerful 
CISTROME TOOLS and GALAXY TOOLBOX of the 
Galaxy/Cistrome webserver to explore the data. 

UTILITY AND DISSCUSSION 

Interface and visualization 

Our database provides two different methods for users to 
survey the ChlP-Seq data sets. One is through the regula- 
tor atlas and the other is based on the advanced search 



menu. The 'regulator atlas' can display all of the CR 
ChlP-Seq data sets from each cohort, and users can 
survey one data set in one cell type at a time, whereas 
using the advanced search, they can examine a single 
CR in different cell lines, cell types, tissues and species 
at the same time. To make the data presentation more 
intuitive, the statistics on the cell lines and cell types 
from which the ChlP-Seq data were obtained are also 
shown in the 'collection stats' menu. 

Case exploration 

If the user is interested in a specific CR, he can select the 
name and cell lines (or cell types, tissues) and species of the 
CR ChlP-Seq data (if the cell lines, cell types, tissues and 
species are not set, the database will return all the ChlP- 
Seq data for this CR) in the advanced search menu. In this 
article, we use PHF8, an eraser of H3K9me2/3 and 
H4K20mel and a reader of H3K4me3, as an example of 
the exploration procedure. 

Step 1 

Assuming the user is interested in the ChlP-Seq data on 
PHF8 in human fibroblast, they can select PHF8 in 



Table 1. The statistics of CR ChlP-Seq data set 



CR type 


CR 
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18/7 


Writer 


11 


CREBBP, Ep300, EZH1, EZH2, 
KAT7, PCAF, SETDB1, 
WHSC1, WDR5, RAC3, 
KAT2A 


71/21 


Eraser 


14 


PHF8, KDM5B, KDM5A, 


49/10 






KDM2A, KDM1A, HDAC6, 






HDAC3, HDAC2, HDAC1, 








SIRT6, SIRT1, KDM5C, 








KDM4A, KDM6B 




Remodeler 


9 


CHD1, CHD2, CHD4, CHD7, 
MTA3, SMARCA4, 
SMARCB1, SMARCC1, 
SMARCC2 


26/14 


Total 


36 




146/48 



After the quality control, we finally got 5 readers with 18 ChlP-Seq 
data sets in human and 7 ChlP-Seq data sets in mouse, 1 1 writers with 
71 ChlP-Seq data sets in human and 21 ChlP-Seq data sets in mouse, 
14 erasers with 49 ChlP-Seq data sets in human and 10 ChlP-Seq data 
sets in mouse, 9 remodelers with 26 ChlP-Seq data sets in human and 
14 ChlP-Seq data sets in mouse, in total, that is 146 ChlP-Seq data sets 
in human and 48 ChlP-Seq data sets in mouse. 



Table 2. The statistics of HM ChlP-Seq data set 



HM type HM HM name ChlP-seq 

number data set 

number 
(human/ 
mouse) 



Methylation 


7 


H3K4mel, H3K4me2, 96/25 






H3K4me3, H4K20mel, 






H3K9me3, H3K27me3, 






H3K36me3 


Acetylation 


6 


H3K9ac, H3K56ac, H3K27ac, 45/11 






H3K18ac, H4K8ac, 






H4K5ac 


Total 


13 


141/36 



After the quality control, we finally got 7 kinds of histone methylation 
with 96 ChlP-Seq data sets in human and 25 ChlP-Seq data sets in 
mouse and 6 kinds of acetylation with 45 ChlP-Seq data sets in human 
and 1 1 ChlP-Seq data sets in mouse. 



human fibroblast in the search menu, as shown in the first 
step of Figure 2. The database will then return a page 
containing the manually curated information for PHF8 
and its ChlP-Seq data information and the related HM 
ChlP-Seq data information for human fibroblast. In 
human fibroblast, two PHF8 ChlP-Seq data sets are 
generated (GSE20753, GSM520383 and GSE20753, 
GSM520384), and the peak file (.bed), the read density 
file (.bw) and the analysis results are freely downloaded 
on this page. 

Step 2 

If the user wishes to acquire a detailed analysis of the 
result from the second ChlP-Seq data set (GSE20753, 
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Table 3. The statistics of CR-histone linkage data set pair 



CR-HM link type 


CR-HM number 


ChlP-seq data set number 


Writer-HM 


13 


165 


Eraser-HM 


12 


171 


Reader-HM 


5 


122 


Total 


30 


458 



There are 13 pairs and 165 data sets of Writer-HM linkage, 12 pairs 
and 171 data sets of Eraser-HM linkage, 5 pairs and 122 data sets of 
Reader-HM linkage. 



GSM520384), they can follow the second step shown in 
Figure 2. This action will provide a page containing the 
following analysis results: (i) a brief summary of this 
PHF8 data set (Figure 2A); (ii) the top three enriched 
DNA binding motifs in the genome region of PHF8 in 
fibroblasts (Figure 2B); (hi) the average ChlP-Seq signal 
profile near the TSS (Figure 2C), the TTS (Figure 2D) and 
across the gene body (Figure 2E); (iv) the genomic distri- 
bution of PHF8 ChlP-Seq peaks (Figure 2F); and (v) the 
average conservation profile across PHF8 ChlP-Seq peaks 
(Figure 2G). 

(1) Modern high-throughput sequencers can generate 
tens of millions of sequences in a single run. A summary 
of these raw ChlP-Seq data is presented in Figure 2A. 
Bowtie (52) was used to align short DNA sequence 
reads to the genomes. Here, 'total reads' means all of 
reads sequenced in a single ChlP-Seq experiment, which 
indicates the resolution, whereas 'mappable reads' means 
reads that align to the genomes with two mismatches 
allowed at most. Next, the mappable reads are used to 
find the peaks using MACS. 'Total peaks' means the 
number of regions in which the factor is enriched under 
the cutoff Q-value (0.01). Here, this PHF8 data set 
includes 60 379 420 total reads, 33 694 178 mappable 
reads and 5128 total peaks, suggesting that this data set 
is of good quality. 

(2) Sequence motifs are often defined as sequence- 
specific binding sites for proteins such as nucleases and 
transcription factors (TFs). They are usually short, 
recurring DNAs and are believed to have biological func- 
tions. MDSeqPos is an internal laboratory software 
platform used for de novo motif detection and known 
motif detection, with the top 1000 peaks being sorted by 
the Q-value. Figure 2B shows the top three motif detection 
results, including the sequence logo, the z-score and the 
factor name and position. In this case, additional factor 
(BRF1, BDP1 and ZNF711) motifs were enriched in the 
top 1000 peaks obtained for PHF8, suggesting that there 
may be co-binding between them. The information 
including the expression of these factors in transcription 
factor encyclopedia (TFe) (http://www.cisreg.ca/cgi-bin/ 
tfe/home.pl) could be acquired through the hyperlink. 

(3) Biologists are capable of visualizing the average 
ChIP signal profile over specific genomic features 
through CEAS (50), such as the TSS (Figure 2C), the 
TTS (Figure 2D) and across the gene body (Figure 2E). 
The profile near the TSS (TTS) focuses on the 3000 bp 
upstream and downstream of the TSS (TTS), whereas 
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Figure 2. Screenshot depicting an example run of the CR Cistrome database. Assuming the user is interested in PHF8 in human fibroblasts, step 1 
will return a page containing the basic information on PHF8, including its alias, introduction, functions and disease associations, which are manually 
generated from the literature and also external links to NCBI, UniProt, Wikipedia and GeneCards are provided, as well as ChlP-Seq data infor- 
mation and ChlP-Seq data information for the related HMs in human fibroblasts. Step 2 will provide a page containing the top three motifs for 
PHF8 in human fibroblasts and the average profile near the TSS and TTS as well as the average gene profile, genome enrichment, average 
conservation profile across PHF8 ChlP-Seq peaks and a brief data summary, which could help indicate the general quality of this data set. Step 
3 will generate PHF8 data set id and H3K4me3 data set id for each comparison as well as the Venn diagram, distribution of their overlap peaks, the 
average PHF8 and H3K4me3 ChlP-Seq signal profile in PHF8's binding sites and the reads density plot of PHF8 and H3K4me3 in PHF8's binding 
sites. 



Nucleic Acids Research, 2014, Vol. 42, Database issue D455 



the average profile across the gene body focuses on the 
1000 bp upstream of the TSS and 1000 bp downstream 
of the TTS (all meta genes were divided into 3000 bins). 
Here, there is a strong peak that can be viewed near the 
TSS, indicating that PHF8 is enriched and exhibits func- 
tions in fibroblast promoter regions. 

(4) The orange bar in Figure 2F represents the genomic 
distribution of PHF8 ChlP-Seq peaks, including the fol- 
lowing five regions: the promoter, downstream, coding 
exon, intron and distal intergenic regions. Based on com- 
parison with the genomic distribution generated by chance 
(the blue bar), we can see that PHF8 is clearly enriched in 
the promoter region, which is consistent with the average 
ChIP profile near the TSS (Figure 2C). 

(5) The average conservation profile across the PHF8 
ChlP-Seq peaks (Figure 2G) is also presented. Conserved 
sequences indicate similar or identical sequences across 
different species, and highly conserved sequences tend to 
be biologically functional. Here, we focused on the 500 bp 
(for broad HM peaks, we set it to 2000 bp) upstream and 
downstream of the summit of each PHF8 peak, and the 
peak height was much higher in the middle than the 
surroundings, indicating that the middle part is more 
conserved and functional than the surroundings. 

Step 3 

As PHF8 is a reader of H3K4me3, and there are also 
ChlP-Seq data for H3K4me3 in human fibroblasts, 
figures representing a detailed analysis between PHF8 
and H3K4me3 can be obtained through the third step. 
Here, we list the PHF8 data set ID and H3K4me3 data 
set ID for each compared pair as well as providing (i) the 
Venn diagram between PHF8 and H3K4me3 (Figure 2H) 
and the distribution of their overlap peaks (Figure 21), (ii) 
the average PHF8 and H3K4me3 ChlP-Seq signal profile 
in PHF8's binding sites (Figure 2J) and (iii) the reads 
density plot of PHF8 and H3K4me3 in PHF8's binding 
sites (Figure 2K). 

(1) The Venn diagram (Figure 2H) shows the overlap of 
the PHF8 and H3K4me3 peaks; the red circle represents 
all PHF8 peaks; the blue circle represents all H3K4me3 
peaks; and the overlap represents the shared peaks 
between PHF8 and H3K4me3. Here, PHF8 and 
H3K4me3 overlapped greatly, indicating that they are 
functionally related. The orange bar in Figure 21 shows 
where the overlap peaks in Figure 2H are enriched 
throughout the genome, including the promoter region, 
downstream region, coding exon region, intron region 
and distal intergenic region, whereas the blue bar shows 
the distribution generated by chance. Here, the overlap 
peaks for PHF8 and H3K4me3 are enriched in the 
promoter region compared with the by-chance 
distribution. 

(2) The average PHF8 and H3K4me3 ChlP-Seq signal 
profile observed within the PHF8 peaks (Figure 2J) 
provides the reads density of PHF8 and H3K4me3 in 
the PHF8 peaks. Here, we focus on the 1000 bp 
upstream and downstream sections of each PHF8 peak 
summit, calculate the PHF8 and H3K4me3 read density 
every 50 bp and obtain 40 reads density values and line 



them. The red line and the right red j-axis represent 
the reads density of PHF8 and the black line and the 
left black j-axis represent the reads density of 
H3K4me3. Additionally, H3K4me3 is enriched within 
the PHF8 peak summits. 

(3) The reads density plot for PHF8 and H3K4me3 
within PHF peaks (Figure 2K) also provides the 
H3K4me3 reads density among PHF8 peak regions. It is 
generated to reflect the reads density of a given HM 
among the binding sites for a given CR. Each dot refers 
to one CR binding site, which has been trimmed to 150 bp 
upstream and downstream of the peak center. The value 
of X-axis (T-axis) of each dot stands for CR's (HM's) read 
density in this CR binding site, that is the CR (HM) ChlP- 
Seq reads number in this binding site normalized by the 
binding lenghth (300 bp) and then transformed by using 
loglO. We produced an image scatter plot of two data sets 
in which the colors indicate the density of the points in the 
scatter plot. Here, H3K4me3 is enriched in the PHF8 peak 
regions. 

CR CISTROME SUMMARY 

CR Cistrome is a ChlP-Seq database containing informa- 
tion on CRs and CR-HM linkages in human and mouse, 
and it comprises all qualified CRs with available public 
ChlP-Seq data, manually curated information on these 
CRs, including their full names, aliases, introductions, 
functions, known disease associations and CR type as 
well as the ChlP-Seq data analysis results. This database 
also provides related HMs 1 ChlP-Seq data analysis and 
CR-HM linkage analysis results for readers, writers and 
erasers in cases where there are available HM ChlP-Seq 
data collected under the same condition as the associated 
CRs. Each CR could be linked to NCBI, UniProt, 
Wikipedia and GeneCards to provide user alternative in- 
formation. CRs can be surveyed through either the 
advanced search options or the regulator atlas menu. 
This database will be useful for different users, for indi- 
viduals who are interested in epigenetic mechanisms, it is 
easy to acquire the features of the CR ChlP-Seq data and 
associations between CR and HM ChlP-Seq data. For 
advanced users, it is convenient to download the processed 
ChlP-Seq data, and if the users generate CR ChlP-Seq 
data themselves, they can achieve a better comparison 
and integration with the public ChlP-Seq data through 
our database. 



FUTURE DEVELOPMENTS 

We will pay close attention to any updated ChlP-Seq data 
for our collected CRs and HMs, and we will process them 
and add the results to the database as quickly as possible. 

AVAILABILITY AND REQUIREMENTS 

CR Cistrome is available at http://compbio.tongji.edu.cn/ 
cr and http://cistrome.org/cr/. Although we recommend 
Safari as the default web browser, this database also 
supports other standard web browsers. 
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TOOLS FOR ANALYZING CHIP-SEQ DATA 

We have provided all the tools and parameters we used for 
analyzing data in the FAQ part of our database. 

Bowtie 

Bowtie is an ultrafast memory-efficient tool for quickly 
mapping large numbers of short DNA sequences (reads) 
to large genomes (52). Our query input files were FASTQ 
files, and the alignments were conducted in the 
SAM format. If >1 reportable alignment was mapped to 
a particular read, we only retained one alignment and 
suppressed all the others. 

Samtools 

Samtools is a set of tools that processes alignments in the 
BAM format (53). Here, we used Samtools to convert 
SAM files into compressed BAM files and to merge 
samples that are replicates. 

MACS 

Calling peaks is the main function of MACS (model-based 
analysis of ChlP-Seq), which is used for identifying TF 
binding sites (48). It is a powerful analysis method for 
ChlP-Seq data. We used 0.01 as the Q-value cutoff, as 
this represents a stringent standard and can generate 
peaks with a higher confidence level. In the case of dupli- 
cate tags at the same location, we retained duplicate tags 
up to 1 because such results can improve prediction 
accuracy given the same complexity of the ChlP-Seq 
library. To conveniently and horizontally compare all of 
the ChlP-Seq data in this database, we processed all of the 
ChlP-Seq data without building a shifting model and used 
73 bp as the shift size. When the experimental design 
included two biological replicates, we only generated the 
merged peak file (.bed) and read density file (.bw). If the 
data set possessed control data files ('Input DNA' or TgG 
control'), the binding site prediction preferentially uses the 
control data files as the background; otherwise, MACS 
will randomly sample the genome as a control. When 
there were > 10 000 peaks for a data set, we only used 
the top 10000 for further analysis. 

BEDTools 

BEDTools is a set of utilities for addressing common 
genomics tasks (54). We used its intersectBed function to 
find the overlap regions of bdg files generated by MACS 
and the chromosome length bed file, in case that the 
MACS peak calling crossed the boundary of chromosome 
length. 

bedGraphToBigWig 

bedGraphToBigWig is a UCSC tool (49). Here, we used it 
to convert a bdg file into a bw file to reduce the storage 
burden. 

CEAS 

CEAS (cis-regulatory Element Annotation System) is a 
tool for providing statistics on CMP enrichment at 



important genome features, such as for specific chromo- 
somes and promoters. Here, we used it to generate average 
ChIP enrichment signals over specific genomic features, 
including the TSS and TTS, as well as gene profiles and 
genome enrichment. For each factor, we used the top 5000 
peaks to analyze the distribution of cis-regulatory 
elements. 
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